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Abstract 

We calculate the mutual information (MI) of a two-layered neural network with 
noiseless, continuous inputs and binary, stochastic outputs under several assumptions 
on the synaptic efficiencies. The interesting regime corresponds to the limit where the 
number of both input and output units is large but their ratio is kept fixed at a value 
a. We first present a solution for the MI using the replica technique with a replica 
symmetric (RS) ansatz. Then we find an exact solution for this quantity valid in a 
neighborhood of a = 0. An analysis of this solution shows that the system must have 
a phase transition at some finite value of ex.. This transition shows a singularity in the 
third derivative of the ML As the RS solution turns out to be infinitely differentiable, 
it could be regarded as a smooth approximation to the MI. This is checked numerically 
in the validity domain of the exact solution. 
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1 Introduction 



The aim of this work is to study the properties of a binary communication channel pro- 
cessing data from a Gaussian source, when the output state is stochastic. The architecture is 
a two-layered feedforward neural network with N analogue input units and P binary output 
units. The mutual information (MI) is evaluated in the large N limit with a = £ fixed. 
Research in this direction was previously done in Q, where the case of a noiseless binary 
channel was studied, and in Q which dealt with the case of a Gaussian source corrupted 
with input noise. 

The main motivation of this work is a technical one. In Q and § the MI of binary 
channels was obtained by means of the replica technique and the replica symmetry ansatz 
[||]. However there have not been attempts to show the validity of this solution. In this paper 
we give an analytical solution of the MI of the channel without making use of the replica 
technique. In order to compare both methods, the replica symmetry ansatz (RSA) solution 
of a general stochastic binary channel is also evaluated. While the RSA yields an expression 
of the MI for all values of a, the exact analytical solution turns out to be valid only up 
to some a = 0(1). However, our conclusion is that the correct solution is the analytical 
one and that there is a (possibly large order) phase transition located at the value of a 
where the analytical solution ceases to be valid. The RSA solution has to be regarded just 
as a smooth approximation to the MI, interpolating between the correct small and large a 
regimes. 

There are several other motivations for doing this investigation. Once the MI of the 
channel is known, the problem of extracting as much information as possible from the inputs 
can be addressed. This optimization problem leads to interesting data analysis. Optimizing 
the MI, a criterion known as the "infomax" principle [|J], is a way of unsupervised learning 
(see, e.g., H). The parameters of the model (that is, of the channel) adapt according to this 
principle and in this way they learn the statistics of the environment (that is, of the source). 
One can also optimize the MI by adapting the transfer function itself |], . Another form of 
this type of unsupervised learning is the minimum redundancy criterion |8[ . Both have been 



used to predict the receptive fields of the early visual system |j 1C , 11, [T^] . The relation 
between them has been discussed in ref.|7]]. Another motivation is that learning how to solve 
this particular non-linear channel could provide the techniques to deal with other type of 
non-linearities. Little is known on the properties of systems other than linear, except for 



threshold-linear networks [T^, 14, 15] (treated with the replica technique), approximations 



for weak non-linear terms in the processing J16[ , some general properties of the low and 
large noise limits [0, 17] and an analytical treatment of binary commutication channels 
either noiseless [|J or with an input noise ||. 

The paper is organized in the following way: The model is explained in the next Section, 
where the notation and the relevant quantities are also given. In Section |3lthe exact calcu- 
lation of one of the contributions to the MI (the "equivocation" term, see [|18| ) is presented. 
In Section H] the evaluation of the other contribution to MI (the entropy term) is discussed. 
In its first subsection, this is done by the usual replica technique. The exact solution is 
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obtained in the second subsection. In Section [| the RS expression of the MI is analysed in 
several asymptotic regimes. A numerical analysis of the RSA solution is also presented at 
the end of this section. The comparison between the exact and the RSA solutions is done 
in Section ||. A discussion about the existence of a phase transition is given in Section 0. 
Section [8| is devoted to the analysis of the Replica Symmetry Breaking (RSB) solutions. The 
conclusions are contained in Section [9|. Finally several technical questions are presented in 
the Appendices. 



2 The model 

We consider a two-layered neural network with N inputs £ G $l N and P binary outputs 
v G {Z-x) p . The input vector £ is distributed as a Gaussian with zero mean and covariance 
matrix C £ M NxN (?H): 

„ e -K(C)-r 

The feedforward connections are denoted by the matrix J S Mp x n(?R.) and its matrix 
elements by {Jij}(i = 1, ■ ■■,P;j = 1, N). Instead of considering a fixed matrix we prefer 
to compute the average MI over an ensemble of stochastic binary channels. The {Jij} have 
also a Gaussian distribution, with zero mean value and two-point correlations T: 

<C JijJi'j* »= Sa'Tjji, (2) 

where the double angular brackets indicate the average over the channel ensemble and T 6 
Mnxn(?R) ■ Notice that those connections converging to different outputs are independently 
distributed. The coupling matrix J can be also regarded as P random iV-dimensional vectors 
Ji(i = 1,...,P) given by the rows of J. From eq.(||), we have that each of these rows is 
distributed independently as : 

p{Ji) = vmm • (3) 

Let us now define the local field as h = J £ . The output state is computed by means 
of the probability distribution p(v\h) that the output vector is v for a given local field 
h. Most interesting problems have a factorized output distribution, we will then assume 
that p{v\h) = n£i P( v i\h>i)- We will also require the reasonable condition that p{vi\h,i) = 
fpihiVi) {f3 denotes an output noise parameter), where fp(x) is an arbitrary function satis- 
fying < fp(x) < 1 and fp(x) + fp{-x) = 1. 

Both the replica and the exact calculations will be done for any of those f(x). However 
we will study with some detail the case (to be referred to as the Hyperbolic Tangent Transfer 
function, HTT): 
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/(*) = gS f e -^ =^(l+tanh(/3x)). (4) 

The deterministic channel [jlj is obtained either when f(x) is chosen as the Heaviside 
function 6 or in the large (3 limit of the HTT function. 



Let us now define the mutual information I(v,£ |J) |18|, 19, 20 1 between the input and 
output vectors, given the channel parameters J: Q 

m?\j) = e p(v,i\j)iog p ff;f |J j , (5) 

v,q, 

where p(v\3) is the output vector distribution given J. The joint probability p(v, £ |J) can 
be written as 

p(v,Z\J)=p(£)p(v\Z,J), (6) 

where , J) denotes the conditional distribution of the output vector v given the input £ 

and the channel J. Since the relation between the input and the local field h is deterministic, 
it can be substituted by p(v\h). 

First we need to define the output entropy for fixed couplings J 

H(v\3) = -J2p(v\J)^gp(v\J), (7) 

v 

and the entropy of the output conditioned by the input £ , again for fixed couplings (the 
equivocation term): 

H(v\i,3) = -Jd N i p(i)J2P(v\i,3)logp(v\i,3). (8) 

v 

Then the MI can be expressed as 

I(v,i\3) = H(v\3)-H(v\i,3). (9) 

We are interested in the mutual information / =<C I(v,(, \3) 3> averaged over the channel 
ensemble. Then, calling I\ =<C H(y\3) S> and I2 =<C H(v\£ , 3) 3>, we have I = I\ — I2, or 
in terms of MI per input unit, i = i\ — 22, where ^1 = 77 and 12 = jf- 

Each term will be studied separately. In the next section we compute the equivocation 
term I2. This can be obtained exactly by means of simple arguments. The output entropy 
term requires more care. It will be evaluated in section |4.1| using the replica technique and 
in section |4.2| using exact analytical methods. 



1 In eq. |E| and hereafter log(s) = . In the derivations, however, we make use of ln(:r) because of the 
Taylor expansions. 
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3 The equivocation term I2 



Since p(v\£,J) factorizes, it is convenient to define the single output entropy, Tt(hi), 
which is a function of hi : 

H{hi) = - J2 P(vi\hi) In p( Vi \hi) , (10) 

Wi=±l 

where one should keep in mind that h = J£ . Then : 

H(v\Z,J) = J2ld N Z p(i)H{hi) (11) 
i=i 

and 

j 2 = E/dJ P (j)fd N i P {i)H{hi). (12) 
1=1 

I2 can be easily evaluated in several simple examples. In the deterministic case it is zero. 
In the large noise limit {(3 — ► 0) it reaches its upper bound I2 = P In 2. For the HTT function, 
eq. (||), we have 7i{h) = hi{e^ h + e~^ h ) — f3h tanh/3/i, and this single output entropy can 
be substituted in eq. Ql2"D to obtain l<i- In Appendix [A] the details of such calculation arc 
presented; here we recall the final formula eq. (58), valid for a matrix M = TC having all 
its eigenvalues of the same order and any function 7i{h). The equivocation term per input 
unit is then: 

/oo g— z 2 

</ -— H(V2M z) . (13) 
-00 V 71 " 

with M = Tr(M) (the trace of M). For the sigmoidal HTT function, we obtain: 

00 

* 2 = a£ (-l) m+1 ^ m , (14) 

m=l 

where 

A m = ^ + (-- 2mf3l)(l - erj{m^))e m ^l (15) 

and /3q = v2M (3, which we shall call the reduced noise parameter. Alternatively, we will 
also use the reduced temperature To = /?o _1 - The symbol erf(x) stands for the error function. 
One can easily obtain several limits. For small To : 

vrS/2 

k « T , (16) 

D 

where one observes that 12 grows linearly with Tq . For small /3q, 
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^aln2-|/3 2 , (17) 

in this case %2 departs from a In 2 quadratically with (5q . 

For other transfer functions one obtains the same qualitative result, although with dif- 
ferent coefficients. This is simply because these coefficients depend on the derivatives of the 
transfer function in the neighborhood of f3 = and (3 = oo, respectively. 

4 Calculation of I\ 

We now compute the output entropy term defined in eq.(Q). First we notice that the 
(discrete) probability density of the outputs v is given by: 

p#=p(ti\3) = jd N i P (i)p(v\i,j) . (is) 

Since p^ is a probability density, J2y Pjj = 1 an d so 

E^«P^ +1 » -i 

h = - lim — . (19) 

rwO n 

<C p n ^ x S> can be written in terms of replicated variables {£* }, a = 0,1,..., n in the 
following way : 



„ n 

«p^ +i »=jd3 p(j) / n 

a=0 



(20) 



Here the local fields are: 



N 

K = J2 J V$- ( 21 ) 

We will only compute the integer order moments of p^j. The continuous order moments 
will be obtained by naive extrapolation of them. Actually they can be obtained in a com- 
pletely rigorous way (although in a rather complicated fashion) because the integer moments 
contain enough information to reconstruct the probability distribution of the variable p^j. 

It is obvious from the distribution of J that each element Jij is independent of the others 
with different output index i. Using this and the fact that each Jj has an even distribution 
(Gaussian) we have that <C 3> is independent of v. Thus we can write : 

2"<p^> -1 
n 

where p stands for the conditional distribution p^ with a specific choice of v , e.g. Vi = 
+l,i = l,...,P. 
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In eq. (|20| ) we can apply, for almost every J, the Bessel-Plancherel identity in each of 
the integrals over the £ 's: 

^ —fat — »a 

I d N C e ^ d J {2 l C) fj fW) = I dP ^ e- 2 ^^ fj KO ■ (23) 

The function / is the Fourier transform of / and must be understood in the distributional 
sense, and A = J C J*. This expression holds for any value of P and N (see Appendix [B] for 
details). Then, the characteristic function p({u a } a =o,...,n) associated to the joint probability 
distribution of the replicated local fields, p({h } a =o,...,n) is: 

where the matrix U is the sum of all the projectors associated to each replica vector, 

n 

[/«, = $>?«?'• (25) 

a=0 

From eqs.(^) and ( p3|) we obtain the following result: 

« »= / n d^Km) n n /«) . (26) 

This formula is exact for the moments of integer order of p. This will be the starting point 



of subsection 4.2, where we will calculate the moments of p exactly. Before that we present 



the RSA solution in the next subsection. 

4.1 The RSA approach 

After a rather lengthy algebra we obtain the entropy term I\ . Some details of the 
calculation are described in Appendix |C|; here we only give the final result. I\ is a function 
of two order parameters, here called x and s. Its value per input unit is given by: 

if SA = aR(x) + \t[ In (Id^ + sG)]- , (27) 

where Q is the normalized matrix, Q = N ^ . We have made use of the symbol r for a 

normalized trace operator, r(-) = jjTr( ■ ). The order parameters satisfy the self-consistent 
SP equations: 

x = sr( ld 7V x7V+^ )/r( ldvx G v+^ ) , (28) 
-2c^ + l) 2 § 

The function R(x) is the average entropy of an effective transfer function g x defined as: 
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g x (y) = -L I" dw e -(^) 2 /( J ^-w) . (29) 



More precisely, 



R(x) = -= I dy e-y 2 S{g x {^y)) , (30) 



where S(z) is the entropy of a binary probability, i.e. , 

S(z)=-zhxz-{l-z)hx(l-z). (31) 

It is worth noting that for the deterministic case g x {y) = j(l + er/(y)), and for the fully 
random case g x (y) = \- In the deterministic case, there is a simple relation between our 
parameters and those (q and g)used in Q: q = s and q = We prefer to use x instead 
of q because it usually yields simpler expressions. 

4.2 The Exact Solution 

In this subsection we present an exact evaluation of l\ valid for a < a c , where a c is of 
order one. It is necessary to assume that all the eigenvalues of M = TC are of the same 
order. The details of the calculation are presented in Appendix we only give here the 
final result for the moments: 

<C p H+1 >= 2 _P( " +1) e -$Tr[\n(Id NxN -l k 2 naQ)} 

e -fTr[ln(Id7Vx7V + f fc'aff)] j ( 32 ) 

where k is defined as: 

roc y 2 /-^ 

dy ye--f(VMy) . (33) 



Let us now extrapolate eq. (32) to non-integer n. This gives our analytical estimate of i\ : 



1 .. 2- p (™ +1 ) < p n+1 > -1 

if 1 = hm 

N n^o n 

k 2 1 2k 2 
= aln2 a + -r[ln(Id iV x7V + a Q) ] . (34) 

7T 2 7T 

The numerical constant k is real-valued, and it is expected to be of order one. In fact, 
the maximum of k is reached in the deterministic case ( f(x) = 9{x) ). This gives k = 1, 
independent of M. Notice that the minimum of k is reached by f(x) = 6(—x), giving 
k = — 1. The minimum of k 2 is realized by f(x) = |, what gives k = 0. 
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The computation done in this section (eq. (^)) is equivalent to a Taylor expansion of 
the original equation for the moments, eq. (^). This can be checked by explicit evaluation 
of the derivatives of the two expressions. f] 

5 Analysis of the RSA solution 

Using eq. (|27|), together with the SP equations (^H), we obtain expansions of if SA 
at small and large a and (3q {f3$ = V2M/3). We will make explicit calculations for the 
deterministic and the HTT functions (the completely random case always gives i\ = a In 2). 

5.1 Small a limit 

Let us first investigate the deterministic case (Po — ► oo) in this regime. From eq. (p8[), 

we can see that s ~ —a and x ~ s t(Q 2 ) ~ 2r ^p - a. This gives the first two orders of the 
expansion of i\ in powers of a: 



if SA «aln2-a 2 ^, (35) 
1 a«i 7r 2 

where we see that, as expected, the second order is negative. The next order in To gives, 
after solving the SP equations up to order To, 

i* SA « aln2-a 2 ^+a 2 ^T 2 . (36) 

This is a positive contribution. However, this does not mean that the MI increases with To; 
in fact, the term Vi gives a larger contribution of order aTo, as can be seen in eq. (|l6|). More 
precisely, 



jRSA = fiSA _^^ aln2 _ tl aT , - « 2 lMl + T -^L±o?Tl . (37) 

6 TT Z 3 



We now calculate the first order correction in (3q <C 1 to %i SA . The leading order is the 
fully stochastic case, and if SA = %2 = a In 2 (x = s = 0). Up to the next order, if SA is: 

- aln2-a 2 ^V (38) 

From eqs. (|l7| ) and ([&]), we obtain : 

^ 5A w ~ T -ir a2p °- (39) 



2 To compute the derivatives of eq. (|26|) with respect to OL one has first to make explicit its dependence 

he parameter N by expressing M in tern 
the resulting integrals are easy to compute. 



on the parameter N by expressing M in terms of Q (M = ^jj-Q). Then, after setting each derivative at OL — 0, 
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5.2 Large a limit 

In this regime, and in the low temperature limit, we obtain from eq. ( p8|) that x ~ s 
and s w Aoa-y/x, where ^4o is the constant given in Q, [g] : 

A = -^j^dz e- z2 S( 1+er 2 f ^ ), A « 0.72. From here we obtain s « ^ga 2 , x « Aga 2 . 
Substituting these parameters in eq. (27) one obtains the known result 0], j2|: 

if A « lna + i+lnA + ^r[lng] . (40) 
a>i 2 2 

Adding weak output noise, and assuming a T$ — » we have: 

^ « lna + i + lnylo + ir[lng] + ^a 2 r 2 . (41) 
CI»i,t <ki 2 2 12 



From here and eq. (|16[): 

^ « In a + In A + I + \r[ lnfi? ] - ^aT + ^a 2 T 2 . (42) 

In the opposite limit, /3o <C 1 (large temperatures), and also assuming a/? 2 , small, it is 
straightforward to see that : 

i? SA n « aln2-^a + ir[ln(Id A r xJV + i/3 2 ag)] , (43) 
Q;>i,t >i 4 2 2 

which together with eq. (|l~7|) gives: 

« ir[ In (Wjvxjv + \dlocQ) ] . (44) 
which shows that the MI decays as 0q when /3q —> 0. 

5.3 Numerical analysis 

The plot of lni RSA (which is obtained combining eqs. (|l~3|) and (27)) versus In a, using the 



HTT for several values of the reduced noise parameter (3q = v2M /3, is shown in Fig. |]. The 
correlation matrix was taken proportional to the identity: M = ^-IdTVxN • As expected, 
for each a, the MI decreases as the temperature increases. It is also interesting that an 
increase in the temperature moves the saturation point (the change from the close-to-linear 
regime to the asymptotic one, in which the MI increases slower with a) to greater values of 
a. 



6 Comparison between the exact and the RSA solutions 

The analytical result presented in eq. (^J) seems rather astonishing as it provides a very 
simple expression for i\, compared with the cumbersome formulae of the RSA solution. Then 
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Figure 1: In i vs. In a computed with the RSA for the deterministic transfer function 
and the HTT function, for several values of (3q : 

1. Solid line: f3$ = oo (Deterministic) 

2. Dotted line: (3q = 10 (near to deterministic) 

3. Light dashed line: /3q = 1 

4. Dashed line: (3q = 0.1 (not far from full stochasticity) 

the following two questions arise: first, whether the two solutions do or do not coincide at 
least in the range of validity of the exact one. Secondly, if the exact MI can be analytically 
extended to greater values of a. We will see that the answer to both questions is no, at least 
for the deterministic transfer function. 

With respect to the first question, an expansion in powers of a can be easily evaluated 
for the deterministic case. It turns out that the corresponding Taylor coefficients coincide up 
to the second order, but the third is different. For instance, if the matrix M is proportional 
to the identity we observe that i an - i RSA = if" - if SA m ^a 3 at the lowest order in 
a. It should be noted that i an is always greater than % RSA ( se e figure ^). Both graphs are 
very close up to an undetermined value of a near a = 1 ( In a ~ ) , from which they split 
away fast. Detailed numerical studies for small a ( a £ [.0001 , .005] ) confirmed a cubic 
divergence between the two Mi's with coefficient ~ A and a deviation form this value less 
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In i 




-6 -4 -2 2 4 6 8 10 12 



In a 



Figure 2: In i vs. In a for the RSA solution (solid line) and for the analytical solution 
(dashed line), for the deterministic transfer function. 

than 0.25 %. 

As to the second question, the large a expansion of the RSA solution is (eq. (flOf)): 

^hxa + l+]nAo + lr[hxg]. 

Q!>i 2 2 

It is consistent with what is known about the continuous outputs, which should be repro- 
duced when a goes to infinity. On the other hand, the analytical solution gives, in this 
limit: 

»T « a ( In 2 - -) + \ In a + \ In - + \t[ In Q ], (45) 

QI»1 \ 7T / 2 2 TT 2 

which is a qualitatively very different behaviour. Since the analytical solution is exact for 
small a, with a convergence radius 0(1), the previous expansion suggests that the channel 
exhibits a phase transition. 
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7 Discussion of the phase transition 

In this section we give a series of arguments to support the existence of a phase transition 
at a c = 0(1). 



1. The first argument is provided by the behavior of the moments. The analytical com- 
putation of the integer moments, eq. (|32|), is exact in the thermodynamic limit. Yet, 
those moments cannot be correct for every value of a. This is because they diverge at 
the values a™ = Tr/(2k 2 n). On the other hand since p is a positive variable bounded 
by one (and then its moments should be less than one) one can conclude that eq. ( ^6[ ) 
presents critical points before those values. A natural guess would be that these sin- 
gularities appear at values of a that follow the same behavior l/(k 2 n). ^| 

2. The critical point of <C pin p 3> is related to the critical points of the moments (p6|). 
We then expect that it has a phase transition at some a c ~ 1/k 2 . As an example 
we consider the completely random channel (k = 0), where according to the previous 
argument the transition in pushed to infinity. In fact, the expansions of i\ computed 
with the analytical and the RSA solutions coincide in the large T limit, in both regimes 
of a (eqs. © and @). 



3. One could infer the existence of the critical point, observing the behaviour of the 
probability density p(h). If one considers this distribution for only one replica, a 
dramatical change in the shape of the function takes place when a goes from 1 to 2. 
Considering its Fourier transform, eq. (|53"|), it can be seen that p(u) behaves at oo like 



u 



~ N (u denotes the modulus of u) , while the volume element behaves as u p . This 
means that this function is integrable (i.e., p(u) is a L 1 functionF] ) up to a = 1. It is 
also a square integrable function (i.e., it belongs to L 2 ) in this range. From a = 1 to 
a = 2 it is no longer a L 1 function, but it still belongs to L 2 ; and beyond a = 2 it is 
no longer in I? . What does this mean in terms of p(h) ? 

• Below a = 1, p(u) G L 1 . Consequently its Fourier transform p{h) is bounded 
(that is, belongs to L°°). Besides, since p(h) is a probability density it is also 
in L . The same argument holds for its derivatives in the thermodynamic limit. 
This is because derivation in /i-space is equivalent to multiplication by powers of 
u in n-space. Since the order of the derivatives is finite, the leading behaviour in 
the thermodynamic limit is not changed. 

It follows that p(h) and all its derivatives belong to L 1 n L°° . This means that 



p(h) belongs to the Schwartz's class (see for instance [21]). Then, it is a very 
regular, fast decreasing function. 



' 5 Remark: Since the moments factorize as the product of contributions related to each eigenvalue of M, 
we expect that there is a critical value of OL for each of them. The functional behaviour at these transitions 
is the same, differing only in the critical value of OL where they occur. This is not a serious complication, 
although one sould keep in mind that the distribution of eigenvalues of M is relevant. 

4 = [/I/I 9 ]* <+~} 
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• Beyond a = 2, p is no longer in L 2 , so p cannot belong to L 2 either (since the 
Fourier transform is an isomorphism on L 2 ). Thus, p cannot belong to L°° (as p 
belongs to L 1 , then it would belong to L 1 n L°° C L 2 , which is a contradiction ). 
Then the graph of p is broken by one or more divergences to oo. 

• Between a = 1 and a = 2, the transition between the other two regimes has 
to occur. 

For more than one replica, heuristic arguments permit to say that the main contri- 
butions to the characteristic function behaves like u~ N , independent of the number 
of replicas. The volume element behaves like u p ^ n+1 \ By the same arguments used in 
the case of a single replica, now p({h a }2=o) exhibits a transition which takes place be- 
tween a = l/(n + 1) and a = 2/(n + 1). This is in agreement with the main conclusion 
obtained in the first comment. 

Thus, we have proved that the joint probability distribution of the replicated fields, 
p({/i a }™ =0 ), undergoes a phase transition at a some finite a. Recalling that 

<C p n+1 3> is calculated averaging n"=o Tli=l fi^i) with this function, it is thus 
reasonable to think that the integer moments of p and the MI could exhibit a phase 
transition caused by the transition in the own distribution. 

4. Another argument in favor of the existence of a transition is given by the behavior 
of the information capacity. It has been proved pi that this quantity has a third 
order transition for the deterministic channel. The high order of this transition makes 
the function rather smooth and the critical point hard to detect. The information 
capacity is only an upper bound of the MI, but it is plausible that the latter has a 
similar behavior. 

These comments lead us to conclude that the MI undergoes a phase transition. What is 
then the meaning of the RSA solution? The expansion in powers of a of the RSA solution 
differs from that of the exact one at the third order, which is precisely the order of the 
transition for the information capacity. Besides, a detailed study of the RSA solution shows 
that the dependence of if^ SA on a is infinitely smooth: this solution exhibitis no change in 
its behavior. 

The conclusion is that the completely symmetric ansatz does not provide a wide enough 
family of solutions and the maximal MI is not attained by this ansatz. This explains why 
the exact solution is always above the RSA one. So, RSA seems to be a smooth regular- 
ization of the true MI. This would explain why it splits away from the true MI in a cubic 
way, suppossing that the latter possesses a third order transition. On the other hand, the 
behaviour at large a of the RSA solution is consistent with that of the information capacity 
and of the MI in a network with continuous output. Then, it is plausible that the RSA 
provides a smoothening for MI which asymptotically has the correct behavior, but which 
masks completely the critical point. 
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8 Further steps: Beyond the RSA 



We have explored the possibility of breaking the replica symmetry by modifying the 
ansatz for U and V (See Appendix |C|). Our first attempt consisted in the usual RSB 
ansatz. After rather lenghty calculations this led us to exactly the same solution given by 
the RSA. 

We also tried what can be called the Segregated Ansatz (SA), in which the first of the 
replicas is split from the other n. Then we assume: 

^o°o = U , 

Uo b = U$ = Ui, b=l,...,n 

U° aa = U 2 , a = l,...,n [Ab) 

U° ab = U 3 , Va^b E{l,...,n} 

and analogously for Vo Q 

Under this ansatz the RS solution verifies again the SP equations. But in addition, an 
new infinite set of functions of a appeared that also verify the SP equations. The MI will 
then be given, at each a, by the function providing the maximal MI. We observed that at 
large a this infinity of solutions contributes below the RSA. However, for arbitrary values 
of a the problem is too complicated to deal with. 



9 Conclusions 

In this paper we investigated the information processing by a noisy perceptron channel. Our 
network has N real- valued input and P binary output neurons, which state is determined by 
the joint probability distribution of the input and output states P(v, £ ). We performed the 
calculation for a general continuous and bounded transfer function, depending on a noise 
parameter. Our study generalizes previous results obtained for deterministic channels [lj 
using the replica technique. We also give the explicit expressions for the mutual information 
at different asymptotic regimes of the load parameter a = P/N and the noise (3. 

The mutual information per input unit can be decomposed in two pieces: i = i\ — i 2 - 
The second term, which extracts the wrong bits of information (the equivocation), can be 
calculated exactly because of the factorization of the probability. The entropic part i\ is 
more difficult to compute. Here we computed it by means of the replica technique and 
analytical methods. 

Our main result is that for values of a up to some value 0(1) there exists an exact 
solution for i, which we found explicitly (eq. (|34|)), This solution is different from the replica 
symmetry ansatz solution (eqs. ( p7| ) to (|3l|)). A numerical computation of both solutions 
gives the remarkable result that they are extremely close to each other up to a ~ 1 (Fig. |2|) . 
A small a expansion shows that the two solutions are equal up to the second order. Although 

5 This ansatz is justified because it splits a typical nxn box from the matrices, which are (n + 1) x (n + 1) . 
This splitting allows the segregated replica to behave independently from the others. 
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the corresponding Taylor expansions differ above the third order, the numerical agreement 
up to a ~ 1 is excellent (a relative difference of less than 0.9% up to a = 0.1). This is due 
to intriguing cancellations between higher orders. 

Our conclusion is that there exists a critical value a c of order one, above which a drastic 
change of the mutual information occurs. This signals the appearance of a phase transition. 
Above a c the analytical solution is not valid, one of the reasons is that it does not have 
the correct large a behaviour (it violates a bound given by the information capacity). On 
the other hand, even if the replica symmetric solution is wrong at small a, it does have 
the correct asymptotic behaviour. Our interpretation of the RS solution is that it should 
be considered as a smooth regularization of the true mutual information, which is given by 
the analytical solution, eq. (^J), for a < a c . The precise value of a c cannot be determined 
by our techniques. There is numerical evidence [22] supporting the validity of the analytical 
solution and the conjecture that the RSA solution is an excellent interpolation between the 
small and large a behaviors. The analysis of the origin of the discrepancies between the 
RSA and the analytical approaches will be the subject of a future work. 

We have also explored some other schemes beyond the completely symmetric ansatz. 
These are based on different types of replica symmetry breaking ansatze such as the usual 
breaking of the symmetry || and the separation of the first replica from the others. In 
the first case it was shown that the new solution coincides with the symmetric one. In the 
second, and because of the complexity of the problem, we have not been able to give an 
explicit final result. 



Note added in proof 

Part of this work was presented at the " Interdisciplinary Workshop on Neural Networks" , 
Wiirzburg, Germany, (October' 95) and at the "Fisica Estadistica'96" meeting, Zaragoza, 
Spain (May'96). 
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Appendices 



A Calculation of I2 

To compute I2 in the general case, we use again the fact that £ and h are deterministically 
related, which leads to 

p 

I 2 = J2l dJ P( J )I dP h p(h\J)H(hi) (47) 
i=i 

or, in terms of the Fourier transforms of the field distribution p(h\J) and of 7i{hi) (p(u\J) 

A 

and TC(ui), respectively) 

p A 

i=l 

The Fourier transform of the field distribution is computed in Appendix ^. One has 

p(u\3) = _ (49) 

where A = J C J* G Mp x p(5ft). After integrating over the J in eq. (EH), we obtain: 



p A 
h = ^2Jd p u p(u)H( Ui ) , (50) 

where /S(tt) is the characteristic function of p(h). Although we cannot calculate the probabil- 
ity density of h, we can have an explicit expression for its Fourier transform by comparing 



eqs. (48) and (pOl), Replacing eq.([49|) in eq.(f4^) we obtain 



p(u) = 1 „ . (51) 

V ; Vdet (Id NPxNP + 4vr 2 U ® M) V ; 



Here, 

• IdjvpxAfp is the identity matrix in NP dimensions . 

• "0" stands for the tensor product between U and M. 
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• M = r C is a constant, N dimensional matrix. Since h is of order one, Tr(M) is also 
of order one (Tr stands for the trace). 

• U is a P dimensional matrix denned as the projector on u: (U)u' = UiUv . 
Since p{u) is invariant under similarity transformations of M, it can be expressed as 

p{u) = . 1 (52) 

nf=i yjdet (Id PxP + 4^ mj U) 

where {rrij}, j = 1, N, is the set of eigenvalues of M. The matrix U has only one non-zero 
eigenvalue, which is \u\ 2 = u ■ u and thus: 

m=e -^UU^ 2m ^\ 2 K (53) 

The computation of I2 does not need the whole joint distribution p(h) but only the 
marginals p(hi);i = 1,...,P. By permutation symmetry, it is obvious that all of them are 
given by the same function. Let us compute for example p(h\). Its Fourier transform is 

p( Ul ) = p(m,0, p ~ 1/ ,o) = e^E.I.Mi+^^h) 2 ) (54) 

and since all the marginals are the same function, all the terms in I2 are the same. Then I2 
reads 

/oo 
dh p{h)H{h) . (55) 
-oo 

So far there is no hypothesis upon the matrix M. Particularly interesting is the case in 
which all the m/s are of the same order, namely, of order 1/N (as we have already said, 
Tr(M) is 0(1) ). In this particular case, 



p(u) « e -s Sf=i 4 - 2 ™^ 2 + (e- N ) = e - 2 - 2 *" 2 + 0(e~ N ), (56) 



_ 1 

'Af»l 

where M =Tr(M). In the thermodynamic limit the term 0{e~ N ) becomes negligible and 
p(h) is : 

e ~h 2 /(2M) 

lim p(h) = -. (57) 
V2ttM 

(Note that this expression makes explicit the reason why M = 0(1)). The conditional 
output entropy now is : 



/oo p~ z2 
dz —= H(V2Mz) . (58) 
-00 yvr 



P 

'—00 
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B Computation of p(u\J) for P > N 

We define the Fourier transform of a function F(h) as the function F(u) given by : 

F(u) = J d p h F{h)e~ 2 ^- n . (59) 

The evaluation of p(u\3) in the case P < N is simple. This is because for almost every 
J the random vector h follows a Gaussian distribution with correlation matrix A = J C J* 
and det(A)/ 0. Then, 

p(«|J) = e- V?IA5i . (60) 

We now prove that this equation is still true when P > N. Let us first notice that in 
this case det(A) is necessarily null, and consequently the random vector h is not Gaussian. 

Let us compute p(v\3) for the particular vector v = (1, 1, 1). Denoting this as p we 
have: 

_i|*(C)- 1 |* i p p 

P = fd N £ e ' HHhi) = Jd p u p(u\J)Hf(u t ) (61) 

v /det(27rC) f- = \ 

with hi = J2jLi Jij an d u % being its conjugate Fourier variable. For P > N, the first N 
components of h are independent random variables and the other P — N depend upon the 
former (for almost every J). 

We split the matrix J into two matrices: K G MtvxaK^); Kjj' = Jjj'\ an d L G 
M P - NxN (M),L kj = J N+k j,k = l,...,P-N;j = l,...,N: 




and for almost every J , K is invertible. Then, we split h = (h°,h r ), h° e $l N and 
h 1 £ $t p ~ N . Moreover, h° is gaussianly distributed with zero mean and covariance matrix 
A = KCK', and h 1 = (L KT 1 ) h°. 

In this way, we obtain Uf=i f(hi) = Uf=i /(/»?) 11^ /([(L K" 1 ) h%), and hence p 
can be written as: 

.-i/l^Ao)- 1 /! ' N P-N 

p=fd N n° n/( fe °) n /([( lk i^)' ( 62 ) 

v /det(27rA ) f-J[ ^ 
If g(h) is a function of vectorial argument, and f(x) has real argument, we have that: 

/oo 
dcg(u-ca)f(c) , (63) 
-oo 



19 



where the hat symbol stands for the Fourier transform and a is an arbitrary constant 
vector. It should be noted that g is a multidimensional Fourier transform while / is the 
one-dimensional Fourier transform. 

Let us denote by dk the (P — N) N-dimensional vectors defined by the rows of L K _1 . 
Applying the previous formula to the expression for p, and after using the Bessel-Plancherel 
identity, we obtain: 

, /t N P-N P-N 

p = Jd N u' e" 2 ^ A ^ fd p - N c J] /Vi - E <*&)j) II Hck) • (64) 

j=l k=l k=l 

Interchanging now the order of integrations and performing the change of variables u°, 
related via u = u° + J2k=i^ c kdk, we obtain: 



N P-N 

p=!d p - N c jd N u° n/(«?) n /(<*) 

j=l k=l 

-2^(u Ao5 ot +Er^ 1 ^^,4A 4,+2^f : : i JV Cfc n Ao4) 



(65) 



It is convenient to combine u and c in a single P dimensional vector u = (u , c) . 
Expressing eq. (65) in terms of this vector, we can use the vectors dk to simplify the bilinear 
expression in the exponent as: 



A = J C J* , 



that depends only on A. From the right hand side of eq. ( |6T|) we have 

p(u\3) = e- 2 ^±tf . 



(66) 



(67) 



C RSA derivation for ii 

We now derive the Saddle Point (SP) equations. These are then simplified by using the 
Replica Symmetry Ansatz (RSA) |J. The first order parameters are the overlap of two 
replicated Fourier transforms of the local field: 

Uab = -^u a ■ u b , a, b = 0, ...,n (68) 

The pre-factor is taken in order to ensure that it is of order one. These are the elements 
of a matrix U G M r n +i)x{n+l) (^) • Then, the Fourier transform of the joint distribution of 
the replicated local fields, eq. (|2^), can be expressed in terms of this matrix as: 



p(K}) = II r dU ab 5(U ab - I^^e-^E^i-detild^Dx^D+^P^U] (6g) 
a<b J -°° r 
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Now we introduce an order parameter V, conjugated to U. To linearize the quadratic form 
in the v^'s we will use P new variables Wi, which are (n + l)-dimensional vectors. 

After substituting eq. (|69|) in eq. (26), we can perform the integrals over the u a . Since 



these integrals are the anti- Fourier transforms of the f(uf), the <C p n+1 S> can be expressed 
in terms of the product of the transfer functions simply : 

« pn+l »= [ \[(-iPdV ab dU ab ) 
J a<b 

e ^ P E a < b U ab V ab - % Y% =1 lndet I Id (n+l)x (n+1) +^ 2 m j tS ] 

n / ^ , ; ~ n firk) • m 

«=i J \/det(27rV) a=o 



Met (2ttV) 

[i = \/— !)■ This can be written as: 



3 2 ^ P E a <6 Ejli lndet [ Id (n+l)x („+i)+47r 2 Pm J -U ]+P In Z(V) 



(71) 



where 



Z(V) = / cP+^ n /("7=)- ( 72 ) 

J A/det(2vrV) a=o V 71 " 



In the large N limit (a = fixed), the integrals over U and V in eq. (71) can be solved 



N 

by the SP method. This gives: 



« p n+1 »« e G ( u o.Vo) (73) 
where Uo and Vo are the SP values and 

1 N - 
G = 2KPY,U ab V ab - - £ lndet [Id (n+1)x(n+1) + 4vr 2 Pm,U] + PlnZ(V). (74) 



a<fe ^ j=l 



The RSA is: 



U° aa = Uo, Va / V a ° a = Vo, Va 
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The starting point is eq. (|73|), where the function G is evaluated with the RSA given 
in eq. (|75|). Defining the matrix 1 £ M( n+1 ) x ( n+1 )(5R) as (1 ) a b = 1 V a, 6, V can be 
expressed as: 



v = ( u o - -vi)Id (n+ i) x(n+ i) + -vil 



and its inverse is: 



(V) 1 = x Id (n+1)x(n+1) + x 1 1 



(76) 



(77) 



where xq = — ^_ — anc j Xl 



VQ-^Vl («0-5fl)(f0 + f fl) 



. This form of (V) 1 allows us to express 



Zi\) (eq. (|72)) in a more convenient way: 



Z(V) = (2tt)- 



.2±i § 

2 Xn 



x + (n + l)xi 



„ n 
J a=0 



(78) 



Notice that Z(V) — > ^ as n — > 0. We now expand Z(V) up to order n, Z | + 
nh(xo,xi), h = dZ / dn\ n= Q. Up to this order we obtain: 



G « 2vrP(n + l)v ito + vrP n - - 5J(1 + A^PmjUo) - - 1 T 47r 2 P ^ 

.7=1 .7=1 J 



"0 



-^^iVln(l + 47r 2 Pm j (uo-wi))-Pln2 + 2Pn/i(xo,xi) . (79) 
2 i=i 

The SP equations extremize G with respect to its variables. From the SP equation 
dG/dvo = one obtains that uq is linear in n: uq = nu. Replacing eq. ( |79|) in eq. (ff3| ), and 



this in eq. (22) we have: 



jRSA = _ 27T p VQ u + ttP^tj + 2ir 2 PMu - 2ir 2 PMu + - ^ In (1 + A^Prriju) - 2Ph(x , x x ) 

(80) 



3=1 



where u = — u\. The SP equations are : 

v = 



ttM 



u = -^dh/dvo 
u = -dh/dv\ 



(81) 



22 



u is a Lagrange multiplier, that can be easily removed by substituting the value of vq in 
jRSA 

The evaluation of h(xQ,xi) requires some care. We express Z as: 



Z(V) = (2vr)-— x 2 Jx + (n + l)x x 



X3 p 2 *^ 

oo \/2vr 



n+l 



where 



/OO 12., 
-OO 

Computing the term of order n of Z we obtain: 



-ln2-ij^±^M(x ,x 1 ; 
2 2 1/ 2ttx 



with 



M(x ,xi) = 

where the function F(y) is defined by: 



2 _ 

dz e 2 -o F(^(z)) , 



F(y) = iln^-iln(l- 2/ 2 ) 



and its argument is : 



, 2x 

9(z) = \l 



2 1 + 2 



1 2 

dw e~z x ° w sinh(V-2;i zw)f(w/y/ir) . 



Substituting fo(xo,xi) given above in eq. flSOD, we obtain: 



(82) 



(83) 



(84) 



(85) 



(86) 



(87) 



^ ^ m 2 + P^gf M(x , x x) + 1 1 ln(l + 4^.) - 1 1 T^S^ 



and the SP equations become: 
Xq = 

x\ = 



1 

ttM 

.i a 



Xq 



We can substitute in I^- SA one of the parameters, for example xq . Defining x = —irMxi, 
s = 4tt 2 M olu and rearranging conveniently eqs. (|88| ) and (|89|), we finally obtain eqs. ( P7|) 
and (l2l). 
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D Analytical derivation of I\ 



This exact calculation starts from eq. (|2q) . We assume that all the eigenvalues {pij}3=i,...,N 
of the matrix M are at most of order i. We define now U in a slightly different way from 
eq. fl6S|): 

U ab = v? ■ if. (90) 
These elements are of order P and hence rrij\J is of order a. p({u a }) is computed as in 



the identity matrix (what can be donde if a is less than - * times a geometrical factor 



subsection 4.1 . Then, the logarithm in the exponent of eq. ( |69| ) can be expanded around 
the identity matrix (what can be 
of order one, that depends on Q) 

p {{i t}) = g e^r^Tr(M-)Tr(lJ--) (gl) 
m=l 

Let us remark that this is not an approximation. It is an exact derivation valid in a 

(undetermined) range of values a of order one . We can alternatively write this in the 
following form : 

„ oo - 

= e -2- 2 ME:=ol" I Yl e ^{~^ 2 MrNTr(G m )Tr({\]/Nr) _ (g2) 

m=2 

The second factor can now be expanded, leading to polynomials in traces of powers of U/iV. 

Given a function F({u a }), we now define its average with the transfer function (or 
shortly, its transfer-average) as: 



^ F »- = f f[ d P iT e- 2 - 2 ^T F(K} ) f[ f[ . (93) 

J a=0 a=0 i=l 



Notice that 



^ 1 = 2 - p (™+ 1 ). (94) 

This property comes from the fact that f(x) + f(—x) = 1, and so f(x) = 1/2 + a(x), 
where a(x) is an odd function bounded by — | and \. Notice that, although this will not be 
required here, in the physically reasonable cases a(x) is an increasing and almost everywhere 
continuous function, < a(x) < \ , x > such that a(x) — > ^ when x — > oo. 
In terms of these transfer-averages, the moments read: 

oo 

« pn+l >>= 2 -P(n+l) + £ £ c t l]::: t, A tx, ; .^ > (95) 



2<si<...<s A 
s 1 t 1 +...+s x t x =. 



where 
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rtu..,t, _ t^T ( NM , H+ - +H (Trm^Jt (^[(g)'*])'* (qk\ 
^--^ h\...t x \ [ 2 > s *i - A ^ 



and 



A iw A A = ^ (^[(U/JV)* 1 ])' 1 ... (Tr[(U/iV) SA ])' A • (97) 

These transfer- aver ages have a very simple expression in the thermodynamic limit, what 
allows us to rearrange the whole expression in a convenient way. First, we must notice that: 



du e - 2 - 2 ^« 2 u 2r f(u) = ~6 0r (98) 



roa „ - _ „ . ^ 

2 

because f(x) = | +a(x) and a(x) is odd. We now prove a factorization property , eq. ( 103 ). 
of the transfer-averages of traces of U that will be useful to computate of <C p n+1 3>. 
The trace of the r-th power of U can be written as: 

n P 

Tr(U r )= £ £ <>- 2 1 < 2 2 <|< 3 3 < 4 3 ---< r :>r i <<- (99) 

ai ,...,a r =0 ii ,...,i r =l 

After taking the transfer- average on this expression, one should notice that the contribution 
of each term does not depend on the particular indeces i's and a's present in that term. It 
only depends on the number of different variables in the term and the power of each vari- 
able. This is due to the independency and permutation symmetry of the u's. It is possible 
to rearrange eq. (^) , expressing it as the sum of each different contribution times a combi- 
natorial factor. This factor is the number of terms giving that particular contribution. Since 
the contributions themselves are of order one, the thermodynamical limit is determined by 
the combinatorial factors. In this limit, P — > oo (r kept finite) and the combinatorial factors 
scale as P raised to the number of non-repeated indeces i. Then, no more than two u's can 
be equal. Defining 

A(M) = / du e~ 27T Mu uf(u) (100) 



and considering eq. (pq) for r = and r = 1, the transfer-average of eq. (|99|) can be 
expressed in terms of A: 



-« Tr[(U/NY] »- 2- p ^ +1 hr , (101) 



where 



Xr = {n r + (-l) r n){2\) 2r a T . (102) 



By similar arguments, one can prove a useful factorization property. For the product of 
two traces we have: 
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-«K Tr[(U/N) r ] Tr[(lJ/N) s ] = 2- p ( n+1 ) Xr Xs (103) 
and a similar factorization holds for the product of an arbitrary number of traces. Recalling 



eq. (95), this property allows us to write: 



< p n+1 >= 2 



-P(n+1) 



exp 



N E ^{-^ 2 M) m Tr[{Q) m ] X r 



m=l 



(104) 



Substituting the explicit values of the x ' s > e Q- (102), and preforming the sum, we have: 



< P n+1 >=2 _P(n+1) e -|^[ln(IdjVxJV+167r 2 A 2 nPM)] g - f Tr[ In (WaTx TV-IBtt 2 A 2 PM) ] _ 



P 

These moments can be expressed in a more useful way. Defining k by: 

f°° ~yl 
k= dy ye 2 f(VMy) , 

J — oo 

we have A 2 = — 8 ^ M ■ Using this relation in eq. ( |105| ), we finally obtain eq. (| 



(105) 
(106) 
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List of Figures 





1 In vs. In a computed with the RSA for the deterministic transfer func- 


tion and the HTT function, for several values of (3q : 




1. Solid line: /3q = oo (Deterministic) 




2. Dotted line: /3q = 10 (near to deterministic) 






3. Light dashed line: (3q = 1 




4. Dashed line: /3q = 0.1 (not far from full stochasticity) 


2 In i vs. In a for the RSA solution (solid line) and for the analytical solution 


(dashed line), for the deterministic transfer function 
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